{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rO7X6VuWpHmG"
   },
   "source": [
    "<center>\n",
    "<p style=\"text-align:center\">\n",
    "<img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
    "<br>\n",
    "<br>\n",
    "<a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "|\n",
    "<a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "|\n",
    "<a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
    "</p>\n",
    "</center>\n",
    "<h1 align=\"center\">Model Comparison for an Email Text Extraction Service</h1>\n",
    "\n",
    "Imagine you're deploying a service that condenses emails into concise summaries. One challenge of using LLMs for summarization is that even the best models can miscategorize key details, or miss those details entirely.\n",
    "\n",
    "In this tutorial, you will construct a dataset and run experiments to engineer a prompt template that produces accurately summarizes your emails. You will:\n",
    "\n",
    "- Upload a **dataset** of **examples** containing emails to Phoenix\n",
    "- Define an **experiment task** that extracts and formats the key details from those emails\n",
    "- Devise an **evaluator** measuring Jaro-Winkler Similarity\n",
    "- Run **experiments** to iterate on your prompt template and to compare the summaries produced by different LLMs\n",
    "\n",
    "⚠️ This tutorial requires and OpenAI API key.\n",
    "\n",
    "Let's get started!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install arize-phoenix langchain langchain-core langchain-community langchain-benchmarks langchain-openai nest_asyncio jarowinkler openinference-instrumentation-langchain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NqEaZvJs6EdK"
   },
   "source": [
    "# Set Up OpenAI API Key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if not os.getenv(\"OPENAI_API_KEY\"):\n",
    "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"🔑 Enter your OpenAI API key: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "bSLxWyFV6EdK"
   },
   "source": [
    "# Import Modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import tempfile\n",
    "from datetime import datetime, timezone\n",
    "\n",
    "import jarowinkler\n",
    "import pandas as pd\n",
    "from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser\n",
    "from langchain_benchmarks import download_public_dataset, registry\n",
    "from langchain_openai.chat_models import ChatOpenAI\n",
    "from openinference.instrumentation.langchain import LangChainInstrumentor\n",
    "from openinference.instrumentation.openai import OpenAIInstrumentor\n",
    "\n",
    "import phoenix as px\n",
    "from phoenix.client import AsyncClient\n",
    "from phoenix.otel import register"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Z0WDhK_Y6EdK"
   },
   "source": [
    "# Launch Phoenix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "0yXTfs8lAdZh"
   },
   "source": [
    "First we have to set up our instance of Phoenix and our instrumentors to capture traces from our agent. We'll use both our Langchain and OpenAI auto instrumentors because while our task uses Langchain, our evaluation function will call OpenAI directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px.launch_app().view()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, instantiate a Phoenix client. This acts as the main entry point for interacting with the Phoenix API and can be installed independently of Phoenix itself.\n",
    "\n",
    "Notice that we're using the AsyncClient class, allowing us to run experiments asynchronously. If you need to use Phoenix using synchronous code, you can use the Client class instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px_client = AsyncClient()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yCaNah-F6EdK"
   },
   "source": [
    "# Instrument LangChain and OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracer_provider = register(endpoint=\"http://127.0.0.1:4317\")\n",
    "LangChainInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)\n",
    "OpenAIInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "NzvJVOZuBw6e"
   },
   "source": [
    "# Experiments in Phoenix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ysmbKqO0BxXX"
   },
   "source": [
    "Experiments in Phoenix are made up of 3 elements: a dataset, a task, and an evaluator. The dataset is a collection of the inputs and expected outputs that we'll use to evaluate. The task is an operation that should be performed on each input. Finally, the evaluator compares the result against an expected output.\n",
    "\n",
    "For this example, here's what each looks like:\n",
    "*   Dataset - a dataframe of emails to analyze, and the expected output for our agent\n",
    "*   Task - a langchain agent that extracts key info from our input emails. The result of this task will then be compared against the expected output\n",
    "*   Eval - Jaro-Winkler distance calculation on the task's output and expected output\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xSLNHY6a6EdK"
   },
   "source": [
    "# Download JSON Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4tpAcUSmAvOD"
   },
   "source": [
    "We've prepared some example emails and actual responses that we can use to evaluate our two models. Let's download those and save them to a temporary file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_name = \"Email Extraction\"\n",
    "\n",
    "with tempfile.NamedTemporaryFile(suffix=\".json\") as f:\n",
    "    download_public_dataset(registry[dataset_name].dataset_id, path=f.name)\n",
    "    df = pd.read_json(f.name)[[\"inputs\", \"outputs\"]]\n",
    "df = df.sample(10, random_state=42)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9dneIPyl6EdK"
   },
   "source": [
    "# Upload Dataset to Phoenix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Jk3F8h9lA4MR"
   },
   "source": [
    "Next, we'll upload our dataset to Phoenix. Once this is present in Phoenix, we can run multiple experiments with different models on this one dataset, and compare their performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = await px_client.datasets.create_dataset(\n",
    "    name=f\"{dataset_name}{datetime.now(timezone.utc)}\",\n",
    "    inputs=df.inputs.tolist(),\n",
    "    outputs=df.outputs.map(lambda obj: obj[\"output\"]).tolist(),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "12cVoZ4t6EdK"
   },
   "source": [
    "# Set Up LangChain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "naKQx8TvBGno"
   },
   "source": [
    "Now we'll set up our Langchain agent. This is a straightforward agent that makes a call to our specified model and formats the response as JSON."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = \"gpt-4o\"\n",
    "\n",
    "llm = ChatOpenAI(model=model).bind_tools(\n",
    "    tools=[registry[dataset_name].schema],\n",
    "    tool_choice=registry[dataset_name].schema.schema()[\"title\"],\n",
    ")\n",
    "output_parser = JsonOutputFunctionsParser()\n",
    "extraction_chain = registry[dataset_name].instructions | llm | output_parser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "BgMH06496EdL"
   },
   "source": [
    "# Define Task Function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "o1-BCl1yBh_T"
   },
   "source": [
    "Next, we need to define a Task for our experiment to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def task(input) -> str:\n",
    "    return extraction_chain.invoke(input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pwA52YTd6EdL"
   },
   "source": [
    "# Check that the task is working by running it on at least one Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "first_example = dataset.examples[0]\n",
    "first_input = first_example[\"input\"]\n",
    "\n",
    "task(first_input)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Z7QGQoSu6EdL"
   },
   "source": [
    "# Run Experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VdJn1XLQC-F9"
   },
   "source": [
    "Now we're ready to run our experiment. We'll specify our dataset and task, and generate responses for us to evaluate in the next step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = await px_client.experiments.run_experiment(dataset=dataset, task=task)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "m-DR5rrR6EdL"
   },
   "source": [
    "# Define Evaluator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ir2W6wW4DNHt"
   },
   "source": [
    "Finally, we need to define our evaluation function. Here we'll use a Jaro-Winkler similarity function that generates a score for how similar the output and expected text are. [Jaro-Winkler similarity](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance) is technique for measuring edit distance between two strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def jarowinkler_similarity(output, expected) -> float:\n",
    "    return jarowinkler.jarowinkler_similarity(\n",
    "        json.dumps(output, sort_keys=True),\n",
    "        json.dumps(expected, sort_keys=True),\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Ygypu6Ag6EdL"
   },
   "source": [
    "# Evaluate Experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluated_experiment = await px_client.experiments.evaluate_experiment(\n",
    "    experiment=experiment, evaluators=jarowinkler_similarity\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = [\n",
    "    run.result[\"score\"]\n",
    "    for run in evaluated_experiment[\"evaluation_runs\"]\n",
    "    if run.result and \"score\" in run.result\n",
    "]\n",
    "avg_score = sum(scores) / len(scores)\n",
    "print(f\"GPT-4o average score: {avg_score:.3f} (n={len(scores)})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "kq4I-6tWDYvF"
   },
   "source": [
    "Now we have scores on how well GPT-4o does at extracting email facts. This is helpful, but doesn't mean much on its own. Let's compare it against another model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TaEaKgAN_ZJh"
   },
   "source": [
    "# Re-run with GPT 3.5 Turbo and Compare Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GIzXZ1TBDp3J"
   },
   "source": [
    "To compare results with another model, we simply need to redefine our task. Our dataset and evaluator can stay the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = \"gpt-3.5-turbo\"\n",
    "\n",
    "llm = ChatOpenAI(model=model).bind_tools(\n",
    "    tools=[registry[dataset_name].schema],\n",
    "    tool_choice=registry[dataset_name].schema.schema()[\"title\"],\n",
    ")\n",
    "extraction_chain = registry[dataset_name].instructions | llm | output_parser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def task(input) -> str:\n",
    "    return extraction_chain.invoke(input)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "experiment = await px_client.experiments.run_experiment(dataset=dataset, task=task)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluated_experiment = await px_client.experiments.evaluate_experiment(\n",
    "    experiment=experiment, evaluators=jarowinkler_similarity\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = [\n",
    "    run.result[\"score\"]\n",
    "    for run in evaluated_experiment[\"evaluation_runs\"]\n",
    "    if run.result and \"score\" in run.result\n",
    "]\n",
    "avg_score = sum(scores) / len(scores)\n",
    "print(f\"GPT-3.5-turbo average score: {avg_score:.3f} (n={len(scores)})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TrACT8XUD18x"
   },
   "source": [
    "# Compare Results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OoXL1uTAEaAK"
   },
   "source": [
    "View your Phoenix UI to compare the models side-by-side. Higher Jaro-Winkler scores indicate better extraction accuracy. GPT-4o typically outperforms GPT-3.5-turbo on structured extraction tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iIi5yd9LsikE"
   },
   "source": [
    "From here you could try out different models or iterate on your prompt, then run the same experiment with a modified Task to compare results."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
